Introduction to Information Retrieval 


Today’ s lecture 


= Web Crawling 
Introduction to f 
= (Near) duplicate detection 
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Basic crawler operation Crawling picture 


= Begin with known “seed” URLs 
= Fetch and parse them 


= Extract URLs they point to URLs crawled 
and parsed 


= Place the extracted URLs on a queue = Unseen Web 
3] ` 


= Fetch each URL on the queue and ” URLs frontier 
repeat 
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Simple picture — complications What any crawler must do 


= Web crawling isn’t feasible with one machine = Be Robust: Be immune to spider traps and 
= All of the above steps distributed ee z 
a3 other malicious behavior from web servers 
= Malicious pages 
= Spam pages 
= Spider traps — incl dynamically generated 
= Even non-malicious pages pose challenges 
= Latency/bandwidth to remote servers vary 
= Webmasters’ stipulations 
= How “deep” should you crawl a site’ s URL hierarchy? 
= Site mirrors and duplicate pages 


= Be Polite: Respect implicit and explicit 
politeness considerations 


= Politeness — don’t hit a server too often 
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Explicit and implicit politeness 


= Explicit politeness: specifications from 
webmasters on what portions of site can be 
crawled 


= robots.txt 

= Implicit politeness: even with no 
specification, avoid hitting any site too 
often 
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Robots.txt example 


= No robot should visit any URL starting with 
"/yoursite/temp/", except the robot called 
“searchengine": 


User-agent: * 
Disallow: /yoursite/temp/ 


User-agent: searchengin 


Disallow: 
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What any crawler should do 


= Fetch pages of “higher quality” first 
= Continuous operation: Continue fetching 
fresh copies of a previously fetched page 


= Extensible: Adapt to new data formats, 
protocols 
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Robots.txt 


= Protocol for giving spiders (“robots”) limited 
access to a website, originally from 1994 


= www.robotstxt.org/robotstxt.html 


= Website announces its request on what can(not) 
be crawled 


= For a server, create a file /robots.txt 


= This file specifies access restrictions 
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What any crawler should do 


= Be capable of distributed operation: designed to 
run on multiple distributed machines 

= Be scalable: designed to increase the crawl rate 
by adding more machines 

= Performance/efficiency: permit full use of 
available processing and network resources 
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Updated crawling picture 


URLs crawled 
and parsed 


Unseen Web 


URL frontier 
Crawling thfead 
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URL frontier 


= Can include multiple pages from the same 
host 


= Must avoid trying to fetch them all at the 
same time 


= Must try to keep all crawling threads busy 
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Basic crawl architecture 


obot 
filters 


ontent | URL 
seen? | | filter 


URL Frontier 
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Parsing: URL normalization 


= When a fetched document is parsed, some of the 
extracted links are relative URLs 
E.g., http://en.wikipedia.org/wiki/Main_ Page has a 
relative link to /wiki/Wikipedia:General_disclaimer 


which is the same as the absolute URL 
http://en.wikipedia.org/wiki/Wikipedia:General_disclaimer 


During parsing, must normalize (expand) such relative 
URLs 
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Processing steps in crawling 


= Pick a URL from the frontier <c==Whict 
= Fetch the document at the URL 
= Parse the URL 

= Extract links from it to other docs (URLs) 


Check if URL has content already seen 
= If not, add to indexes 

For each extracted URL 
= Ensure it passes certain URL filter tests 


= Check if it is already in the frontier (duplicate URL 
elimination) 
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DNS (Domain Name Server) 


= A lookup service on the internet 
= Given a URL, retrieve its IP address 


= Service provided by a distributed set of servers — thus, 
lookup latencies can be high (even seconds) 


= Common OS implementations of DNS lookup are 
blocking: only one outstanding request at a time 


= Solutions 
= DNS caching 


= Batch DNS resolver — collects requests and sends them out 
together 
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Content seen? 


= Duplication is widespread on the web 

= |f the page just fetched is already in 
the index, do not further process it 

= This is verified using document 
fingerprints or shingles 
= Second part of this lecture 
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Filters and robots.txt 


= Filters — regular expressions for URLs to 
be crawled/not 
= Once a robots.txt file is fetched from a 
site, need not fetch it repeatedly 
= Doing so burns bandwidth, hits web 
server 
= Cache robots.txt files 
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Distributing the crawler 


= Run multiple crawl threads, under different 
processes — potentially at different nodes 
= Geographically distributed nodes 

= Partition hosts being crawled into nodes 
= Hash used for partition 

= How do these nodes communicate and share 
URLs? 
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URL frontier: two main considerations 


= Politeness: do not hit a web server too frequently 

= Freshness: crawl some pages more often than 
others 
= E.g., pages (such as News sites) whose content 

changes often 

These goals may conflict with each other. 

(E.g., simple priority queue fails — many links out of 
a page go to its own site, creating a burst of 
accesses to that site.) 
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Sece2 Oral 


Duplicate URL elimination 


For a non-continuous (one-shot) crawl, test 
to see if an extracted+filtered URL has 
already been passed to the frontier 

For a continuous crawl — see details of 
frontier implementation 


Seco 


Communication between nodes 


= Output of the URL filter at each node is sent to the 
Dup URL Eliminator of the appropriate node 


URL 
filter 


AE] 


Politeness — challenges 


= Even if we restrict only one thread to fetch 
from a host, can hit it repeatedly 

= Common heuristic: insert time gap between 
successive requests to a host that is >> time 
for most recent fetch from that host 
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URL frontier: Mercator scheme 
URLs 


Prioritizer 


Biased front queue selector 
Back queue router 
4 ra ae 


Back queue selector 


Crawl thread lequesting URL 
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Front queues 


4 
Prioritizer 


Biased front queue selector 
Back queue router 
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Biased front queue selector 


=" When a back queue requests a URL (in a 
sequence to be described): picks a front queue 
from which to pull a URL 


= This choice can be round robin biased to queues 
of higher priority, or some more sophisticated 
variant 


= Can be randomized 
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Mercator URL frontier 


URLs flow in from the top into the frontier 
Front queues manage prioritization 
Back queues enforce politeness 


Each queue is FIFO 
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Front queues 


= Prioritizer assigns to URL an integer priority 
between 1 and K 


= Appends URL to corresponding queue 
= Heuristics for assigning priority 


= Refresh rate sampled from previous crawls 


Sec. 20.2.3 


Sec. 20.2.3 


= Application-specific (e.g., “crawl news sites more 


often”) 
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Back queues 


Biased front queue selector 
Back queue router 


SecneOIArs. 


Back queue selector 
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Back queue invariants 


= Each back queue is kept non-empty while the 
crawl is in progress 
= Each back queue only contains URLs from a 
single host 
= Maintain a table from hosts to back queues 


Host name |Back queue 
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Back queue processing 


= A crawler thread seeking a URL to crawl: 
Extracts the root of the heap 
Fetches URL at head of corresponding back queue q 
(look up from table) 


Checks if queue q is now empty — if so, pulls a URL v 
from front queues 


= If there’s already a back queue for v’s host, append v to it 
and pull another URL from front queues, repeat 


= Else add v to q 
When q is non-empty, create heap entry for it 


Introduction to 


Near duplicate 
document detection 
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Back queue heap 


= One entry for each back queue 


= The entry is the earliest time t, at which the host 
corresponding to the back queue can be hit again 


= This earliest time is determined from 
= Last access to that host 
= Any time buffer heuristic we choose 
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Number of back queues B 


= Keep all threads busy while respecting politeness 


= Mercator recommendation: three times as many 
back queues as crawler threads 
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Duplicate documents 


= The web is full of duplicated content 

= Strict duplicate detection = exact match 
= Not as common 

= But many, many cases of near duplicates 


= E.g., Last modified date the only difference 
between two copies of a page 
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Duplicate/Near-Duplicate Detection 


= Duplication: Exact match can be detected with 
fingerprints 
= Near-Duplication: Approximate match 
= Overview 


= Compute syntactic similarity with an edit-distance 
measure 
= Use similarity threshold to detect near-duplicates 
= E.g., Similarity > 80% => Documents are “near duplicates” 
= Not transitive though sometimes used transitively 
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Shingles + Set Intersection 


= Computing exact set intersection of shingles 
between all pairs of documents is expensive 


=Approximate using a cleverly chosen subset of 
shingles from each (a sketch) 


= Estimate (size_of_intersection / size_of_union) 
based on a short sketch 


Doc] >» > [Bean 
GA 

Doc | —> (Shingle set B |—> 
Qaj- GE ~ aa 


Jaccard 
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Computing Sketch[i] for Doc1 


Document 1 


2°4 Start with 64-bit f(shingles) 


2°4 Permute on the number line 


g with T; 
—_o oe e œ&,? 


64 
—_2___________.2 Pick the min value 
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Computing Similarity 


= Features: 

= Segments of a document (natural or artificial breakpoints) 

= Shingles (Word N-Grams) 

= arose is a rose is a rose > 4-grams are 

a_rose_is_a 
rose_is_a_rose 
is_a_rose_is 
a_rose_is_a 

= Similarity Measure between two docs (= sets of shingles) 

= Jaccard cooefficient: (Size_of_Intersection / Size_of_Union) 
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Sketch of a document 


= Create a “sketch vector” (of size ~200) for 
each document 


= Documents that share > t (say 80%) 
corresponding vector elements are deemed 
near duplicates 


= For doc D, sketch,| i] is as follows: 
= Let f map all shingles in the universe to 1..2™ 
(e.g., f = fingerprinting) 
= Let z; be a random permutation on 1..2™ 
= Pick MIN {z;(f(s))} over all shingles s in D 
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Test if Doc1.Sketch[i] = Doc2.Sketch[i] 
Document 1 Document 2 


$ o © ® @ 264 


Are these equal? 


Test for 200 random permutations: T4, 7,... 7299 
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However... 
Document 1 Document 2 


e ae „ngee ee 964 

25m S as 
e—a a __@ _@ 564, t t @ D6 
C A 


os 


A = B iff the shingle with the MIN value in the union of 
Doc1 and Doc2 is common to both (i.e., lies in the 
intersection) 


Why? 


Claim: This happens with probability V7 
Size of intersection / Size of union 
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Key Observation 


= For columns C; C, four types of rows 
GG 
1 
1 
0 
D 0 0 
= Overload notation: A = # of rows of type A 


= Claim 
A 


Jaccard(C,,C,) = ———~ 
A+B+C 
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Random permutations 


= Random permutations are expensive to compute 


= Linear permutations work well in practice 


= Fora large prime p, consider permutations over {0, ..., p — 1} 
drawn from the set: 


F,, ={t,4: lSasp—1,0<b<p-—1} where 


‘a,b 


Ta p(X) = ax + b mod p 
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Set Similarity of sets C; , Ç 


GNE] 
Jaccard(C;,C;) = 
c, Uc] 


= View sets as columns of a matrix A; one row for each 
element in the universe. aj = 1 indicates presence of 
item i in set j 


= Example C, C 
0 


Jaccard(C,,C,) = 2/5 = 0.4 
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“Min” Hashing 


Randomly permute rows 
Hash h(C;) = index of first row with 1 in column 
Ci 
Surprising Property 
P| (C,) =WC,) |= Jaccara(C,,C,) 
Why? 
= Both are A/(A+B+C) 
= Look down columns C, G until first non-Type-D row 
= h(G) = h(G) <> type A row 
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Final notes 


= Shingling is a randomized algorithm 
= Our analysis did not presume any probability model on the 
inputs 
= It will give us the right (wrong) answer with some 
probability on any input 
= We’ve described how to detect near duplication in a 
pair of documents 
= In “real life” we’ll have to concurrently look at many 
pairs 


= See text book for details 


